Skip to content

Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean)#1850

Open
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:sp8192-strict-fullval-ppm-0426
Open

Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean)#1850
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:sp8192-strict-fullval-ppm-0426

Conversation

@someone114514
Copy link
Copy Markdown

Summary

3-seed mean val_bpb 1.00495 (std 0.00072). Best/min seed is 1.00425333 BPB (seed 1337). Compared to the merged 2026-04-09 SP8192 legal TTT record at 1.0810 BPB, this improves by 0.0761 BPB, comfortably past the 0.005-nat threshold and over 100x the observed inter-seed std. All three artifacts stay under the 16 MB cap.

The submission adds one scoring component on top of the existing SP8192 training stack: a binary-lambda-gated PPM-D byte-level mixture applied to the sliding-window NN log-probs at eval time. The mixture is constructed to fit the score-before-update discipline: each byte is scored from the prefix PPM state, then inserted into the PPM counts for future bytes.

metric value
val_bpb (PPM mixture, 3-seed mean) 1.00495
std across seeds 0.00072
best/min seed 1.00425333
improvement vs base legal TTT (1.0810) 0.0761 BPB
training 8xH100 SXM, 600s cap, ITERATIONS=20000
eval sliding-window stride=64 + strict full-val byte PPM mixture, under 600s
total_submission_bytes_max 15,997,433
cap margin at max artifact 2,567 B
seeds run 42, 7, 1337

The Contribution

A binary-lambda-gated PPM-D mixture over an already-scored byte stream, computed at eval time and mixed with the NN's per-byte log-probabilities in probability space.

For each predicted byte at position t:

  1. NN probability: the per-token NN NLL from the existing causal sliding-window evaluation is deterministically spread uniformly over the bytes emitted by that target token. This uses the already-computed sliding NLLs; there are no extra NN forward passes.
  2. PPM probability: classical byte-level PPM-D style scoring over the 256-byte alphabet. Counts are built online from already-scored validation bytes only. No future bytes are read.
  3. Gate: the binary mixture lambda is selected from prefix context confidence before observing the current byte. If the deepest available context has high confidence, lambda_lo=0.05 mostly trusts PPM; otherwise lambda_hi=0.9 mostly trusts the NN.
  4. Mix: p_mix = lambda * p_NN + (1 - lambda) * p_PPM, then -log(p_mix) contributes to byte BPB.
  5. Update: PPM counts are incremented only after the byte's mixed log-probability is recorded.

The implementation uses PPM_ORDER=4, PPM_LAMBDA_HI=0.9, PPM_LAMBDA_LO=0.05, and PPM_CONF_THRESHOLD=0.9 in the submitted logs.

Why this helps here: the parameter-constrained SP8192 NN still has a byte-level surprisal floor on highly repetitive local byte contexts such as identifiers, URLs, numeric literals, and repeated formatting fragments. PPM is strong exactly in those high-confidence local contexts. The binary gate is intentionally conservative: it trusts PPM only when the prefix counts indicate a strong local continuation, and otherwise falls back toward the NN.

Per-Seed Results

Seed Pre-Quant / Post-EMA PPM mix Artifact (B)
1337 1.08627037 1.00425333 15,993,603
42 1.08711004 1.00489563 15,997,433
7 1.08750246 1.00569239 15,995,226
Mean 1.08696 1.00495 15,995,421
Std 0.00072

Three independent seeds, all with ppm_mix < 1.006. The headline number is the PPM mixture returned as quantized_sliding_window val_bpb. The logs also report nn_token_bpb, nn_byte_bpb, and ppm_only for auditability.

Legality / Issue #1017

The PPM mixture is implemented inside a strict score-before-update eval-time path.

Condition How this submission satisfies it
1. Causality Sliding-window NN scoring is strictly causal. Each token is scored from prefix tokens only. PPM context is the byte-prefix of already-scored bytes, never future bytes.
2. Normalized distribution PPM-D produces a normalized distribution over the 256-byte alphabet through its escape mechanism. The final mixture is in probability space, so it is normalized by construction. The NN side remains the standard softmax over the full vocab.
3. Score before update NN sliding scores are computed before PPM bytes are updated. Each byte is scored from existing PPM counts and only then inserted into the count tables.
4. Single pass Each validation byte is scored exactly once in stream order. There is no rescoring, no multi-pass selection, and no prebuilt validation cache.

Additionally:

  • No SLOT.
  • No TTT in the packed artifact.
  • No pre-quant validation adaptation.
  • No ETLB / logit bias.
  • No n-gram cache.
  • No external network access at eval time.
  • PPM state is built fresh inside eval_val_sliding for each run and is not persisted across invocations.
  • Tokenizer is unchanged from the base SP8192 stack.

Implementation Notes

The scorer is native C compiled at runtime with gcc -O3 from the packed script. It uses:

  • open-addressed context tables,
  • rolling byte context keys,
  • inline counts for the first four bytes per context,
  • fixed order-0 byte counts,
  • cached integer logs,
  • precomputed lambda logs,
  • compact raw per-rank token/NLL files in /tmp for distributed sliding collection.

The Python PPM reference and eval-time TTT were removed from the packed artifact to keep the submission under the 16 MB cap. Native exactness was checked against the Python reference during development before trimming.

Compliance Numbers

item value
max final_model.int6.ptz 15,976,001 B
packed train_gpt.py 21,432 B
max total submission 15,997,433 B
cap margin 2,567 B

All three seeds are under 16,000,000 bytes.

Files

  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py
  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/submission.json
  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/README.md
  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed1337.log
  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed42.log
  • records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed7.log

Reproduce

python3 -m pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

RUN_ID=strict_ppm_trim_seed42_8gpu_order4_b32 \
SEED=42 \
PPM_ENABLED=1 \
PPM_NATIVE_ENABLED=1 \
PPM_ORDER=4 \
PPM_LAMBDA_HI=0.9 \
PPM_LAMBDA_LO=0.05 \
PPM_CONF_THRESHOLD=0.9 \
PPM_LOG_CACHE_SIZE=1048576 \
SKIP_QUANTIZED_EVAL=1 \
SLIDING_BATCH_SEQS=32 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py

Change SEED and RUN_ID for seeds 7 and 1337.

@someone114514 someone114514 force-pushed the sp8192-strict-fullval-ppm-0426 branch from 304dff5 to 37ce906 Compare April 27, 2026 06:29
phaniratan1234 pushed a commit to phaniratan1234/parameter-golf that referenced this pull request Apr 27, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 27, 2026
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23

- Merged SOTA still 1.0810 (Day 18, no change since Apr 9)
- PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed)
- SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required
- PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day
- PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable
- PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean
- PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling
- Added Session 23 lessons to CLAUDE.md
- 3 days to deadline (Apr 30) — final GPU run window

https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
@someone114514 someone114514 force-pushed the sp8192-strict-fullval-ppm-0426 branch from 58bed14 to 37ce906 Compare April 27, 2026 23:40
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
… notes

spec 052: PPM-D byte mixture port from PR openai#1850 onto 047B + our anti-hijack
gate tuning. Phase 1 measured end-to-end at mix_bpb_sidecar = 1.00506,
matching PR openai#1850's 1.00495 within 0.0001.

spec 055: full submission run — train 050 baseline from scratch, apply same
tuned PPM at eval. Single train_gpt.py file. Predicts 1.005 +/- 0.003.
Code: exp/055-050-with-ppm-fullrun @ c27be23.

ideas:
- ppm-port-on-047B.md — narrative of the PPM port discovery, headroom
  analysis (1850 vs 1857 vs us), and why anti-hijack was the bigger lever.
- ppm-d-mixture-and-anti-hijack.md — full math: per-token NN -> per-byte
  spreading, PPM-D Howard escape-D, the gate (1850 raw + anti-hijack
  override), log-sum-exp mixture, and the 4.32-bit hijack geometry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
Earlier default 4194304 (OMP-chunked) was suboptimal — saves ~230s eval time
but loses ~0.010 BPB sidecar from chunk-reset penalty. PR openai#1850 chose single-
pass deliberately and pays the 252s scoring cost for the bigger gain.

Single-pass timing on 8H per 1850's measurements:
  pre-quant + gptq + ema:      ~85s
  diagnostic quantized eval:   ~60s
  non-overlap forward (8-way): ~20s
  file gather:                 ~5s
  single-pass PPM scoring:     ~250s  (CPU-bound, not GPU)
  ────────────────────────────────────
  total eval phase:            ~420s   under 600s cap

Smokes (where wallclock matters more than gain) can override with
PPM_OMP_CHUNK_TOKENS=4194304.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 28, 2026
WHY: V1 NGramMixer (fixed-order bigram + Dirichlet uniform smoothing) failed
because cold-start q_bi was uniform → mixing in noise. V2 (TempScaler) failed
because the trained NN is already calibrated. The actual large entropy gap
that PPM byte mixture exploits is *local verbatim repetition* (URLs, code
identifiers, repeated phrases) that a 5M-param NN averages over.

WHAT: Cleary-Witten 1984 PPM-D over the SP token alphabet (Σ_token=8192),
with backoff via escape mechanism. Distribution defined on Σ_token resolves
the byte-vs-token C2 dispute (Issue openai#1872) cleanly. Binary λ gate (PR openai#1850
pattern): if PPM confidence at deepest matched context ≥ threshold,
λ=lambda_lo (mostly trust PPM); else λ=lambda_hi (mostly trust NN).

LEGALITY: All four conditions of Issue openai#1017:
  C1: ctx[k] only contains counts from already-scored tokens
  C2: P_K(·|prev) = recursive PPM-D blend, sums to 1 over Σ_token
      (verified by `test_ppm_c2_full_normalized`); convex combination with
      NN softmax preserves normalization
  C3: λ-gate uses confidence at deepest matched context (prev-only),
      computed before observing target. update_stream is called AFTER mix_nll
  C4: monotonic state, single left-to-right pass

VALIDATION: 23/23 unit tests pass on CPU including a functional toy
benchmark — on a chunked synthetic stream with strong repetition motifs, PPM
gives -3.2 nats/token improvement vs NN baseline. (Real FineWeb is much less
repetitive but the byte-level PPM cluster has shown -0.05 to -0.20 BPB
improvements on this challenge, suggesting token-level can capture similar
entropy.)

INTEGRATION: eval_val sub-chunked W=128 (env: PPM_CHUNK_TOKENS) so within-
batch repetition is captured. State carries across batches via the mixer
object. Eval_val_ttt_phased path NOT touched yet (would need per-doc-slot
PPM tables; deferred to V4 if V3 numbers warrant).

ENV: PPM_MIX_ENABLED, PPM_MAX_ORDER (default 2), PPM_LAMBDA_LO (0.05),
PPM_LAMBDA_HI (0.9), PPM_CONF_THRESHOLD (0.9), PPM_CHUNK_TOKENS (128).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant